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Abstract 

The gist of many (NP-)hard combinatorial problems is to decide whether a universe 
of n elements contains a witness consisting of k elements that match some prescribed 
pattern. For some of these problems there are known advanced algebra-based FPT 
algorithms which solve the decision problem but do not return the witness. We investigate 
techniques for turning such a YES/NO-decision oracle into an algorithm for extracting a 
single witness, with an objective to obtain practical scalability for large values of n. By 
relying on techniques from combinatorial group testing, we demonstrate that a witness 
may be extracted with 0{k\ogn) queries to either a deterministic or a randomized set 
inclusion oracle with one-sided probability of error. Furthermore, we demonstrate through 
implementation and experiments that the algebra-based FPT algorithms are practical, in 
particular in the setting of the fc-path problem. Also discussed are engineering issues such 
as optimizing finite field arithmetic. 


1 Introduction 

The gist of many (NP-)hard combinatorial problems is to decide whether a universe of n 
elements contains a witness consisting of k elements that match some prescribed pattern. In 
the positive case this is naturally followed by the task of extracting the elements of one such 
witness. 

As a result of advances in fixed-parameter tractability, many such hard problems are now 
known to admit algorithms that run in linear (or low-order polynomial) time in the size of the 
universe n, and where the complexity of the problem can be isolated to the size of the witness 
k. That is, the running times obtained are of the form 0{f{k ) • n) for some rapidly growing 
function f(k) of k. This makes such algorithms ideal candidates for practical applications 
that must consider large inputs, that is, large values of n. For example, a recent randomized 
algorithm for the A:-sized graph motif problem runs in time 0(2 fc fc 2 (log k) 2 ■ e), where e is the 
number of edges in the input graph [3]. 

Despite scalability to large inputs, some such advanced parameterized algorithms (like 
the ones for graph motif 0 or for fc-path [I]) have an inherent handicap from a concrete 
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algorithm engineering perspective. They only solve the decision problem. In applications, 
however, one needs access to the witnesses, which puts forth the question whether one can 
efficiently extract a witness or list all witnesses, using the algorithm for the decision problem 
as an oracle (black-box subroutine), and without losing the scalability to large inputs. 

This paper studies the question of efficiently turning a decision oracle into an algorithm for 
witness extraction over the universe U = {1,2,..., n}. Let 3~ C 2 U be the (unknown) family 
of witnesses. We focus on the following oracle: 

Inclusion oracle Given a query set Y C U, the oracle answers (either YES or NO) whether 
there exists at least one witness W £ T such that W C Y. We can motivate this type of 
oracle by observing that most problems have natural self-reducibility that we can use to 
narrow down the universe from U to Y (e.g. take the subgraph induced by the set Y of 
vertices) and then run the decision algorithm. 

In the oracle setting there are at least two natural ways to measure the efficiency of witness 
extraction. 

Number of oracle queries This measure has been extensively studied in the domain of 
combinatorial group testing [9], where the canonical task is to identify k defective items 
from a population of n items, with the objective of minimizing the number of test^j] 
(oracle queries) required to identify all the defectives. While this measure does not reflect 
accurately the amount of computing resources invested in our context—indeed, different 
oracle queries in general do not use the same amount of resources—the group testing 
perspective enables information-theoretic lower bounds and supplies useful algorithmic 
techniques for extraction. 

Total running time Assuming we have bounds on the running time of the oracle as a 
function of n and k, we can bound the running time of extraction of witnesses by taking 
the sum of the running times of the oracle queries. It turns out that we get fair control 
over the total running time already if we know that the running time of the oracle scales 
at least linearly in n. 

The objectives of this paper are threefold, (a) First, we draw from techniques in classical 
group testing to arrive at efficient witness extraction algorithms for inclusion oracles both in 
deterministic and in randomized settings with one-sided error, (b) Second, we show examples 
of parameterized problems which can be solved efficiently in practice by a combination of 
an FPT decision oracle and a group-testing algorithm; in particular, for the k- path problem 
our experimental results show that one can find a 14-vertex witness in a 2000-vertex graph 
within a minute on a typical laptop, (c) Third, we discuss some non-obvious choices we made 
during the implementation: namely the choice of the GF(2 9 ) arithmetic implementation; we 
believe our findings might be useful for implementations of other algorithms applying GF(2' ? ) 
arithmetic. 

To set up a trivial baseline for performance comparisons, it is not difficult to see that 
@(n) queries to an inclusion oracle suffice to extract a witness—simply delete points from the 
universe one by one, with each deletion followed by an oracle query on the remaining points. 
If the oracle answers NO, we know the deleted point was essential and insert it back. When 

1 In the setting of classical group testing, a single test on a set of items determines whether the set contains 
at least one defective item. 
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the process finishes the points that remain form a witness. This, however, is not particularly 
efficient since each oracle query costs at least 0(f(k) ■ n) time, raising the total running time 
to 0(f(k ) • n 2 ) and making the approach impractical for large re. 

Our Results on Extraction. We begin by transporting techniques from group testing 
[9] to arrive at more efficient witness extraction. Our first contribution merely amounts to 
observing that the so-called bisecting algorithm |8j can be translated to work with an inclusion 
oracle and in the presence of one or more witnesses. We also observe that taking into account 
the total running time of the algorithm, the baseline cost of a factor 0(n ) in running time can 
be lowered to 0(k) if the running time of the oracle is at least linear in re, which is the case in 
most applications. These observations are summarized in Theorem [l] 

Let S' be a nonempty family witnesses, each of size at most k, over an re-element universe, 
n,k > 1. We say that a function g : N —> N is at least linear if for all n\,n 2 6N it holds that 
g{n 1 ) + g(n 2 ) < g(n i + re 2 ). 


Theorem 1 (Deterministic Extraction). There exists an algorithm that extracts a witness in 
3“ without knowledge of k using at most 


Q(n, k) 


2k 



re 

k 



queries to a deterministic inclusion oracle. Moreover, suppose the oracle runs in time T(n, k ) = 
0(f(k)g(n)) for a function g that is at least linear. Then, there exists an algorithm that 
extracts a witness in 3“ in time 0(k ■ T(2re, k)) = 0(f(k) ■ k ■ g(2n)). 

Currently the fastest known parameterized algorithms in many cases use randomization. 
Thus in practice one must be able to cope with decision oracles that may give erroneous 
answers, for example it is typically the case that the decision algorithm produces false negatives 
with at most some small probability, but false positives do not occur EGang mg. 

Let us assume that the probability of a false negative is p < Beyond the absence of false 
positives, a further observation to our advantage is that typically witnesses may be checked , 
deterministically, and essentially at no computational cost compared with the execution of 
even one oracle query. That is, we have available a subroutine that takes a candidate witness 
W C U as input and returns whether W E S'. We make this assumption in what follows. 
Thus having access to a randomized inclusion oracle enables deterministic extraction, but with 
randomized running time. These observations are summarized in Theorem [2] 

Theorem 2 (Las Vegas Extraction). There exists an algorithm that extracts a witness in 
S' without knowledge of k using in expectation at most O(ATogn) queries to a randomized 
inclusion oracle that has no false positives but may output a false negative with probability at 
most p < Moreover, suppose the oracle runs in time T(n,k ) = 0{f{k)g{n)) for a function 
g that is at least linear. Then, there exists an algorithm that extracts a witness in S' in time 
0(k • T(2re, k) + (k log k) ■ T(2k, k)). 

An Application: fc-Path. The fc-path problem is one of the basic NP-complete problems, 
a natural parameterized version of the Hamiltonian Path problem. In this problem we are 
given an undirected connected graph G = (V, E), and a natural number k. The goal is 
to find a simple path on k vertices in G. Denote by re = \V\ and m = \E\. In terms of 
dependence on k , the currently fastest algorithm is due to Bjorklund, Husfeldt, Kaski, and 
Koivisto (Tj and can be tuned to run in 1.66 k k°^m time. It uses algebraic tools and only 


3 


solves the corresponding decision problem. We applied a simplified version of this algorithm, 
slightly easier to implement, which runs in 0(2 k km) time, assuming that finite field arithmetic 
operations take constant time (cf. [6j). The algorithm evaluates a certain polynomial of degree 
d = 2k — 1 over the finite field GF(2 9 ), which turns out to be a generating function of all 
witnesses. The algorithm is randomized, and it may return a false negative. The failure 
probability is bounded by , hence by choosing q large enough we can assume it is at most 
\, as required by Theorem pi 

Our universe U is the set of edges of the input graph and we are extracting witnesses 
with exactly k — 1 edges. By Theorem [2] we obtain an algorithm with expected running time 
0{2 k k 2 • rri) for witness extraction. 

However, when we consider actual implementation the above approach should be refined 
as follows. First set the universe U to be the set of vertices and find the set of k vertices S 
which contains a k-ve rtex path. Next, set the universe U to be the set of edges in the induced 
graph G[S ], and find the witness. By Theorem [2j for dense graphs this can give a factor two 
speed-up. 

A Computational Biology Application: Graph Motif. In the graph motif problem 
we are given an undirected connected graph G = (V,E), a vertex coloring c : V —> C, and 
a multiset M of cardinality k consisting of colors in the set C. The goal is to find a subset 
S C V such that the induced subgraph G[5] is connected, and the multiset c(S ) of colors of 
the vertices of S is equal to M. Note that |S| = k. This problem has important applications 
in querying patterns in protein-protein interaction (PPI) networks, see e.g. |4J. Although 
problem is NP-hard, in the data instances coming from this application the graph size is of 
order of thousands and the pattern is very small (according to [3] the number of edges is 
21275 for fly and 28972 for human, and k £ {4,..., 25}), i.e. they are perfectly suited for 
FPT algorithms. A recent randomized decision algorithm [3j solves the corresponding decision 
problem by evaluating a certain polynomial of degree d = 3k — 1 over the finite field GF(2 <? ), 
which turns out to be a generation function of all witnesses. Its running time is dominated by 
performing 0(2 k k 2 ■ m) arithmetic operations in GF(2 9 ), where m is the number of edges in 
the input graph. The algorithm returns false negatives with probability bounded by hence 
by choosing q large enough we can assume it is at most |. It follows that we can use it as a 
randomized oracle in the algorithm described in Theorem [2] 

Our universe U is be the set of vertices of the input graph. By Theorem [2] we obtain an 
algorithm with expected running time 0(2 k k 3 ■ m) for witness extraction, assuming that finite 
held arithmetic operations take constant time. 

Further applications. The list of problems which a) fall into our witness extraction 
framework and b) have the property that asymptotically fastest decision algorithms do not 
return a witness includes Steiner tree m, q -set packing [1], ^-dimensional matching [1], Steiner 
cycle (aka A'-cycle) [21 IT], directed rural postman problem Hi- 

Related and Previous Work. The relations between the time complexity of decision 
problems and their search versions were studied by Fellows and Langston HU- 

Independently of our work, Hassidim, Keller, Lewenstein, and Roditty p3] presented 
a randomized algorithm that extracts a witness for the (weighted) fc-path problem using 
O(klogn) calls to a decision oracle, in expectation. Their approach is to discard random 
subsets (of size n/k) of the vertex set as long as the resulting instance still contains the solution. 
The bisecting algorithm [5] that we extend in this paper can be seen as a cleaner version of 
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Figure 1: Running times of various algorithms for a graph with exactly one witness (upper 
charts) and fl(n 2 ) witnesses (lower charts). Each running time on the graph is the median of 5 
runs for the same input instance. The left charts: a 1000-vertex graph and k G {6, 7,..., 18}. 
The right charts: k = 14 (upper) or k = 15 (lower) and the number of vertices varies. Running 
times on a 2.53-GHz Intel Xeon CPU. 

this idea. First, in the bisecting algorithm larger sets get discarded. Second, the bisecting 
algorithm is deterministic. Hassidim et al. do not analyze how the time of their algorithm 
is influenced by the fact that the oracle is randomized. From an asymptotic perspective 
this is not needed because one can repeat each oracle call multiple times to reduce the error 
probability below an arbitrary threshold. However, in practice this is an unnecessary (though 
only constant-factor) slow-down, which we seek to avoid in what follows. 

Implementation and Experiments. We implemented in C the 0(2 k km)-time decision 
algorithm for the fc-path problem and the algorithm from Theorem [2| which we call ‘fifo’ 
on the charts. The crucial part of implementation of the decision oracle is the finite field 
arithmetic. Somewhat unexpectedly, we found that to optimize the running time, a different 
method should be chosen depending on whether we use the oracle just once (e.g. check whether 
there is a witness) or whether it is used in combination with the algorithm from Theorem [2] to 
find a witness. Details can be found in Section |4j 

We run a series of experiments on a single 2.53-GHz Intel Xeon CPU. We compare the fifo 
algorithm with two other natural candidates. The first is the witness extraction algorithm 
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Figure 2: Running times of witness extraction for Graph Motif problem. Each running time 
on the graph is the median of 5 runs for the same input instance. The left chart: a 8000-vertex 
32000-edge graph and k G {6,..., 14} (The size of the graph is roughly the same as the PPI 
network of human). The right chart: k = 14 and the number of vertices varies from 100 to 
10000 (m = 4n). (Note that both axes use logarithmic scale.) 


of Hassidim et al. |14| combined with the 0(2 fc /cm)-time inclusion oracle, called ‘HKLR’ on 
the charts. The second is the 0(4 k k 27 m)-time algorithm of Chen et al. [S] called ‘Divide- 
and-Color’. It is not based on algebraic tools and finds the witness while solving the decision 
problem. Note that there are many more algorithms/heuristics for /c-path problem which 
would be much faster on particular instances. A natural heuristic is computing the DFS 
tree. If the tree has depth at least k the witness is found and otherwise the graph has 
pathwidth at most k. On the other hand, when the pathwidth p is very small (say, p < |), the 
(2 + \[2) p n°^ algorithm of Cygan et al. m should be fast. However, in this work we want to 
focus on algorithms with best guarantees in the worst case. Disregarding the detailed memory 
layout of the input graph, all the three algorithms we compare are oblivious to the topology 
of the graph apart from the parameters m and k. In our experiments we use two types of 
trees with m = n — 1 as the input graphs. The first type (with a unique witness) consists of 
[(k — l)/2j-vertex paths joined at a common endvertex; when k is odd two of the paths are 
extended by an edge, when k is even one path is extended by two edges and one path by one 
edge. The second type (with H(n 2 ) witnesses) has k odd and all paths are extended by an 
edge. 

The results can be seen on Fig. [lj We see that both fifo and HKLR are much faster than 
Divide-and-Color even for very small values of k. For 1000-vertex graphs our algorithm fifo 
finds (< 10)-vertex patterns below 1 second and (< 20)-vertex patterns below 1 hour. HKLR 
is considerably slower and the difference is more visible when there are many witnesses. 

We have also implemented the 0(2 k k 2 ■ m)-time decision algorithm [3] for the graph motif 
problem, plugged into the fifo extraction algorithm from the present paper. In Fig. [2] one can 
see running times of our implementation. The size of the input instance is typical for the 
applications in protein-protein interaction networks. Similarly as in the case of fc-path, the 
running time of the decision algorithm is essentially the same regardless of the structure of 
the graph and the motif, so we just used a random input graph and a random motif. 
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2 Extracting a Witness Using a Deterministic Oracle 

The objective of this section is to prove Theorem [lj Accordingly, we assume we have available 
a deterministic inclusion oracle. Our strategy is to translate an existing algorithm developed 
for group testing into the setting of witness extraction (Algorithm [I] and Lemma [3]) , and then 
analyze its performance with respect to the total running time, including the oracle queries 
(Lemma [4]) . 

Let us first review the setting of classical group testing, and then indicate how to translate 
classical algorithms to the setting of witness extraction. In group testing, we do not have 
a family of witnesses, but rather a single unknown set D C U consisting of defective items. 
Furthermore, instead of an inclusion oracle (that would test whether D C.Y for a query Y) 
we have an intersection oracle that answers whether D n Y 0 for a query Y . That is, a 
query tells us whether the query set Y has at least one defective item. 

Characteristic to classical group testing algorithms is that they proceed to shrink down the 
size of the universe U while maintaining the invariant D C [/ until D has been identified (that 
is, D = U). Indeed, whenever the (intersection) oracle answers NO, we know that the query 
Y is disjoint from D , and thus can safely delete all points in Y from U without violating the 
invariant. 

In our setting we have to work with an inclusion oracle and cope with the possibility of 
the family T containing more than one witness. Fortunately, it turns out that the setting 
is not substantially different from group testing. Indeed, in analogy with group testing, we 
will also proceed to narrow down the universe U but seek to maintain a slightly different 
invariant, namely “there exists a W £ T such that W C [/”. In this setting we can narrow 
down the universe by the following basic procedure: for a subset A C U we query the inclusion 
oracle with Y = U \ A. If the answer is YES, we know that we can safely remove A from U 
while maintaining the invariant. This basic analogy enables one to transport group testing 
algorithms into the setting of witness extraction. 

In what follows we focus on a translation of one such algorithm, the bisecting algorithm [8]. 
One of its advantages is that it does not need to know the number of defective items in 
advance, and hence in particular it is suitable for our applications where we want to allow the 
witnesses to potentially differ in size. Moreover, this particular algorithm is convenient in our 
further modifications for the randomized oracle model (Sect. [3]). We give the pseudocode of a 
“witness extraction” version of the bisecting algorithm in pseudocode as Algorithm [T] 

The correctness of Algorithm [I] follows from the fact that our invariant “there exists a 
W G T such that W C U” is always satisfied. We remark that Algorithm [l] has a further 
minor difference with the original bisection algorithm in that whenever it partitions a set A 
into A\ and A 2 then A\ and A 2 are almost of the same size (||Ai| — IA 2 II < 1), whereas the 
original algorithm |Ai| = 2r io sI^H— 1 anc j |A 2 | = |H| _ |Ai|. Du and Hwang |8] showed that the 
bisection algorithm performs 0(£;log queries. Below we present a self-contained analysis. 

Lemma 3. Algorithm [i] makes at most 2A;(log 2 f + 2) oracle queries. 

Proof. We can model the execution of Algorithm [l] with a tree T whose nodes are the subsets 
A that have appeared in the queue Q during execution. A node A is a child of node B if and 
only if A was obtained by bisecting B. In particular T is a binary tree with at most k leaves 
and two types of internal nodes: the partition nodes with two children correspond to splitting 
a set into two halves, and the cut nodes with one child correspond to cutting-off a half of a set. 
Each internal node in 7 is associated with 1 or 2 queries. 
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ALGORITHM 1: ExtractInclusion([/) 

Initialize an empty FIFO queue Q; 

Let W <- 0; 

Insert U into Q; 
while Q is not empty do 

Remove the first set A from Q; 
if \A\ = 1 then 
| Let W «- W U A; 
else 

Partition A into Ai and An, arbitrarily so that ||Ti| — |A 2 11 < 1; 
if Includes((7 \ Ai) then 
Let U «- U\A 1 ; 

Insert An into Q; 

else 

if Includes(!7 \ A 2 ) then 
Let U<-U\A 2 ; 

Insert Ai into Q; 

else 

Insert both Ai and A 2 into Q; 

end 

end 

end 

end 

return W 


Let us order 7 arbitrarily so that every partition node has a left child and a right child; let 
us furthermore call the only child of a cut node the left child. For every leaf v form a path P v 
up in the tree by first including v into the path and including each subsequent node into P v 
as long as we arrived into the node from the left child of the node. Such paths P v clearly form 
a partition of nodes in T. 

For every cut node x, let D x denote the subset of vertices that was discarded. For a leaf v 
let S v denote the union of all the sets D x on path P v . For any cut nodes x and y on P v , if 
x is an ancestor of y then \D X \ > 2\D y \ — 1. It follows that there are at most |dog 2 |<S„|] cut 
nodes on P v . Hence the total number of cut nodes is at most T 1°S2 |'St,|] — A: (log 2 ^ + l) 

where the sum is over the at most k leaves v in T and the inequality follows from Jensen’s 
inequality (and the fact that the sets S v form a partition of U \ W, where W is the returned 
witness). Since T is a binary tree, the number of partition nodes is at most k — 1. Thus there 
are at most &(log 2 f + 2) nodes and at most 2fc(log 2 f + 2) queries. □ □ 

A routine information-theoretic argument shows that Lemma [3] is optimal up to constants, 
that is, at least log 2 (^) > k log 2 ^ queries (bits of information) are needed to identify a 
unique witness of size k in a universe of size n. This observation can be strengthened to the 
randomized setting via the Yao principle—in expectation at least | log 2 ^ queries are required. 

We now proceed to analyze Algorithm [l] with a more natural complexity measure, namely 
the total time of the extraction procedure, taking into account the time used by the oracle 
queries. Recall that a function g : N —> N is at least linear if for all ni,ri 2 G N we have 
g{n 1 ) + g(n 2 ) < g{ni + n 2 ). 

Lemma 4. Suppose the time complexity of the inclusion oracle on a query set of size n is 
T(n,k ) = 0{f(k)g{n)), where g is at least linear. Then, the running time of Algorithm^ is 
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0(k-T(2n,k)). 


Proof. We follow the notation introduced in the proof of Lemma [3j Because there are at 
most A: — 1 partition nodes, the total time spent at these nodes is 0(k ■ T(n,k)). Hence it 
remains to analyze the time spent at the cut nodes. It suffices to show that for every leaf v of 
the tree T the total time spent at the cut nodes in path P v is 0(T(n,k)). Observe that at 
every cut node the size of the universe decreases by a factor of 2. Hence this time is at most 
T(n, k ) + T(n/ 2, k) + T(n/4, k) + ... + T(l, k) < T(2n, k) where the last inequality uses the 
assumption that g is at least linear. □ □ 

Lemma [3] and Lemma [4] now establish Theorem |T| 


3 Extracting a Witness Using a Randomized Oracle 

The objective of this section is to prove Theorem [2j Accordingly, we assume we have available 
a randomized inclusion oracle that has no false positives but may output a false negative with 
probability at most p < \. The outcomes of queries are assumed to be mutually independent 
as random events. 

We start with two simple observations regarding Algorithm [T] in the context of a randomized 
oracle. First, since the oracle does not have false positives, the set W output by Algorithm [l] 
is always a superset of a witness. Second, by Theorem [l] we know that the algorithm makes 
at most Q{n , k ) queries to extract a witness in the event no false negatives occur in the first 
Q(n,k) queries. By the union bound, the probability of this event is at least 1 — pQ(n,k). 
This gives us a Monte Carlo algorithm that fails with probability at most pQ{n , k). 

Recalling that we assume we have access to a subroutine that checks whether a given set 
W C {7 satisfies W E T, we would clearly like to transform the Monte Carlo algorithm into a 
Las Vegas algorithm that always extracts a witness, and the cost of randomization is only 
paid in terms of the running time. 

The Las Vegas algorithm now operates in two stages. Let us call this algorithm Algorithm 2. 
In the first stage, we simply run Algorithm [l] and obtain a set W as output. In the second 
stage, we insert each element of W into an empty queue Q. Next, as long as W is not a 
witness, we (1) remove an element e from the head of Q, (2) if Includes(IV \ {e}) returns NO 
then we insert e at the tail of Q and otherwise we remove e from W. Finally, we return W. 

Given that only false negatives may occur, Algorithm [2] is obviously correct and always 
returns a witness. It remains to analyze the expected number of queries and the expected 
running time of Algorithm [2j 

Lemma 5. Algorithm^ makes in expectation O(klogn) queries to the randomized inclusion 
oracle. 

Proof. First we bound the expected number of queries in the first stage. Recall the tree model 
of the execution of Algorithm [l] in the proof of Lemma [3j Let us study the model in the 
presence of false negatives. A false negative at line [l] of Algorithm [l] causes the algorithm 
to view the set A\ as necessary and continue processing it even if it could in be dropped in 
reality. Similarly, a false negative at line [l] causes the algorithm to view A 2 as necessary. In 
particular, each false negative gets inserted into the queue Q and hence into the tree T. 

Now let us study an arbitrary subtree of T rooted at a false negative node. We observe 
that all such nodes either remain false negative nodes, or become exhausted as YES nodes 
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or singleton nodes. (That is, no node in the subtree is a true negative.) Let us study the 
process that creates such a subtree and for convenience ignore the possibility of singleton 
nodes exhausting the process. Let X be the random variable that tracks the size of the subtree. 
Because the left and right child nodes of each node are independently false negatives with 
probability p, we observe that the expectation of X satisfies E[X] = 1 + 2pE[X\. That is, 
E[X] = 1/(1 — 2p). Because p < |, we have E[X] < 2. Since each false negative has to interact 
with true negative and positive nodes, the expected number of queries in the first stage is, by 
linearity of expectation, at most 3 Q(n,k) by Lemma [3| 

Let Wo denote W at the beginning of the second stage. For purposes of analysis we divide 
the second stage into two sub-stages. The first sub-stage finishes when \ W\ < 2k. Assume 
that there was at least one query in the first sub-stage, that is, TTo > 2k. Let Z be the 
total number of queries in the first sub-stage. Then Z = Z\ + Z 2 + Z 3 where Z\ is the 
number of false negative queries, Z 2 is the number of positive queries and Z 3 is the number 
of true negative queries. First observe that Z\ has the negative binomial distribution, that 
is, Zi ~ NB(|Wo| - 2 k,p), and hence E[Zi | \W 0 \] = (\W 0 \ - 2k)^ < \W Q \ - 2 k. It follows 
that E[Zi] < E[|Wo|] — 2k < 3Q(n,k). Now note that that Z 2 is bounded by \Wo\, which is 
bounded by the number of queries in the first stage, so EIZ 2 ] < 3Q(n, k). Call an element e 
of W false if W \ {e} contains a witness and true otherwise. Since there are at most k true 
elements, as long as \W\ > 2k the number of true elements is bounded by the number of false 
elements (if W contains more than one witness then all elements of W may be false). If e E W 
is a true element then the query W \ {e} always returns NO (a true negative); if e is false then 
the query W \ {e} may return either YES (a true positive) or NO (a false negative). Since 
elements of W are tested in queue order, Z 3 < Z\ + Z 2 and hence E [Z 3 ] < 6 Q(n, k). 

Finally consider the second sub-stage, when \W\ < 2k. Let t be the number of false 
elements in W, t < 2k. The algorithm iterates through the queue until there is no false 
element in W. The number of times we iterate over the whole queue is the maximum of t 
independent random variables, each of geometric distribution with success probability 1 — p, 
which by p < ^ is in expectation at most 1 + Ht/ ln(l/p) < 2 H 2 k < 31n2fc (cf. [10]). Since in 
each iteration the algorithm performs at most 2 k queries, the expected number of queries in 
the second sub-stage is then at most 6 kln 2 k. 

The expected number of queries is thus at most 15 Q(n, k ) + 6 fcln 2 £;. □ □ 

Theorem [2] is now established by Lemma [5] and the following lemma. 

Lemma 6. Suppose the time complexity of the randomized inclusion oracle on a query set 
of size n is T(n,k) = 0{f{k)g{n)), where g is at least linear. Then, the running time of 
Algorithm ^} ] is 0(kT(2n, k) + k log kT(2k, k)). 

Proof. By Lemma [4] the total time of the queries in the first stage that returned a correct 
answer is bounded by 0(k • T(2n, k)). 

As argued in the proof of Lemma [5] all nodes corresponding to false negative queries in 
both stages form 0(l)-sized subtrees of tree T. For every such subtree the parent p of the root 
of the subtree corresponds to a query with a correct answer. Moreover the size of the instance 
passed to the oracle in every call in the subtree is bounded by the size of the instance passed 
to the oracle in the query corresponding to p. Hence the expected total time spent at the 
subtree is asymptotically the same as the time spent at p. It follows that all the false negative 
queries take 0(k ■ T(2n, k)) time in expectation. In particular we showed that the first phase 
takes 0{k ■ T(2n, k )) expected time. 
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k = 16, decision algorithm 


k = 12, witness extraction 


Figure 3: Comparison of three implementations of GF(2 9 ) arithmetic. Left: (single run of) 
fc-path decision oracle for instances with no solution. The pattern size is fixed as k = 16 
and the number of vertices n varies. Right: fifo algorithm using fc-path decision oracle for 
instances with exactly one solution (each running time on the graph is the median of 5 runs 
for the same input instance). The pattern size is fixed as k = 12 and the number of vertices n 
varies. Running times on a 2.53-GHz Intel Xeon CPU. 


For every positive query Includes(IU\ {e}) in the second stage there is the corresponding 
(false negative) leaf corresponding to the singleton {e} in tree T. Hence the total time of 
positive queries in the second stage is bounded by the time of the first phase. 

Now we focus on true negative queries in the first sub-stage of the second stage. Consider 
a single pass of the algorithm through all the elements in the queue. Within this pass there 
are at most k true negatives (since witnesses are of size at most k). Moreover, since in the 
second phase there are at least 2 k elements in the queue, we can injectively assign to each of 
the true negative queries a false negative or a positive query for an instance of the same or 
larger size. Hence, the total time of true negative queries in the first sub-stage of the second 
stage is bounded by the total time of false negative and positive queries which we already 
bounded by 0{k ■ T(2n, k)). 

We are left with bounding the time of true negatives in the second sub-stage of the second 
stage. However, since then \W\ < 2k, each query takes just T(2k,k) time. In the proof of 
Lemma [5] we showed that the total number of queries in the second sub-stage is 0(k log k), so 
the desired bound follows. □ □ 

4 Implementation of Finite Field Arithmetic 

The most critical subroutines of the /c-path inclusion oracle we implemented are operations 
of addition and multiplication in a finite field GF(2 9 ). The choice of q is important: the 
oracle returns a false negative with probability at most 2k 2 ^ 1 ■ We can assume that k < 30, for 
otherwise the oracle runs too long. It follows that to guarantee low error probability, say, at 
most 2 -20 , it suffices to pick q = 26. 

Let us recall that elements of GF(2 9 ) correspond to polynomials of degree at most q — 1 
with coefficients from GF(2). Such a polynomial is conveniently represented as a g-bit binary 
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number. The addition in GF(2 9 ) corresponds to addition of two polynomials, that is, the 
symmetric difference (xor) of the binary representations. Multiplication is performed by (a) 
multiplying the polynomials and (b) returning the remainder of the division of the result by a 
primitive degree-q polynomial; this is easily implemented in 0(q ) word operations. We refer 
to this implementation as ‘naive’. 

One can observe that step (a) above corresponds to carry-less multiplication of two binary 
numbers, that is, the usual multiplication without generating carries (Oil x Oil = 101). Such 
multiplication of two 64-bit numbers is available as a single instruction (PCLMULQDQ) on a 
number of modern Intel and AMD architectures. Using the fact that there is an only 5-term 
primitive polynomial of degree 64, step (b) can be implemented using bit shifts and xors [12] . 
We refer to this implementation as ‘clmul’. 

The third natural option is to precompute the whole multiplication table (using the naive 
algorithm) before running the oracle. This takes 4 9 |~g/8~| bytes of memory, so can be considered 
only for small values of q, say q < 12 (even for q = 12 the precomputation time is negligible at 
substantially less than a second). We refer to this implementation as ‘lookup’. 

The left chart of Fig. [3] shows the comparison of the three implementations of GF(2 9 ) 
arithmetic used in a single run of the decision oracle. For ‘naive’ we use q = 26 and for ‘clmul’ 
q = 64. For ‘lookup’ we use q = 7 because for smaller values of q the running time is roughly 
the same; nevertheless since in the tests we look for a pattern of size 16, it gives just a bound 
of | for error probability. To squeeze the probability down to 2 ~ 20 one can run the oracle 
10 times and return the conjunction of the results. We see that although ‘lookup’ is faster 
than ‘clmul’ when the oracle is called once, it is much slower when we repeat the oracle call 
10 times (note also that clmul provides error probability 2~ 59 ). The ‘naive’ method is worse 
than the other two. 

Note however that, if we aim at finding a witness, by Theorem [2] it suffices to guarantee 
that error probability is at most hence for k < 16 we can pick q = 7. The advantage of 
our witness extraction algorithm hfo is that even if it gets a false answer from the oracle, it 
will discover the mistake in the future. Indeed, the right chart of Fig. [3] shows that using 
GF(2 7 ) with ‘lookup’ outperforms using GF(2 64 ) with ‘clmul’, roughly by a factor of four. 
The value q = 7 here is carefully chosen. One one hand, we want q to be large to get small 
error probability for a single query and thus small variance of the whole extraction running 
time. On the other hand, at our machine the multiplication table for q = 8 does not fit into 
LI cache (of size 32K) what results in increase in the median running time. In the table below 
(see also Fig. [4]) we show statistics for 200 runs of the extraction algorithm (n = 1000, k = 12) 
using ‘lookup’ for q = 5, 6 ,..., 12 and ‘clmul’ {q = 64). 


logarithm of the held size 

5 

6 

7 

8 

9 

10 

11 

12 

64 (clmul) 

median [sec] 

4.58 

4.38 

4.39 

4.69 

6.15 

7.30 

9.55 

15.94 

15.92 

maximum [sec] 

12.96 

8.53 

7.61 

7.57 

9.97 

11.18 

9.65 

18.05 

16.77 

standard deviation [sec] 

1.16 

0.68 

0.61 

0.22 

0.30 

0.34 

0.50 

0.43 

0.25 


Clearly, for q = 8 ,9,..., 12 we get increased number of cache misses (the 256K L2 cache 
could fit the table only for q < 8 ). The running times are concentrated very well around the 
median for q = 7, 8 , 9 (on the picture the first and third quantile got merged with the median). 
For q = 10,11,12 we observe increasing variance. This is caused by the fact that the arithmetic 
operations are performed on random numbers, thus the number of cache misses becomes a 
random variable (and its expectation increases with q). For q = 64 and clmul implementation 
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time [seconds] 


Figure 4: Statistics for 200 runs of the extraction algorithm (n = 1000, k = 12) using lookup 
implementation for q = 5,... ,12 and clmul implementation ( q = 64). Running time of each 
execution is visualized as a green circle. All experiments for a given field size 2 q are summarized 
using a boxplot showing a first and third quantile and the mean (thick vertical line). 


we get excellent concentration again because the error probability is very small (most likely 
there was no single error during the 200 runs) and in this method there are no cache misses. 
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A GF(2 64 )-multiplication using PCLMULQDQ instruction 

Below we present the code for multiplication in GF(2 64 ) using a single PCLMULQDQ instruc¬ 
tion, 6 bitwise shifts and 7 bitwise xors. The algorithm is based on the work of Gueron and 
Kounavis [12] . 

uint64_t gf2q_mul (uint64_t x, uint64_t y) 

{ 

uint64_t xy [2]; 
xy [0] = x; 
xy [1] = y; 

_ml28i ab = _mm_loadu_sil28((_ml28i*) xy); 

uint64_t X [2], tmp2, tmp3, tmp4; 

_ml28i tmpl; 

tmpl = _mm_clmulepi64_sil28(ab, ab, 0x01); 

_mm_storeu_sil28((_ml28i*)X,tmpl); 

tmp2 = X [1] ; 
tmp3 = tmp2 >> 63; 
tmp4 = tmp2 >> 61; 
tmp3 = tmp3 " tmp4; 
tmp4 = tmp2 >> 60; 
tmp3 = tmp3 “ tmp4; 
tmp2 = tmp3 " tmp2; 
tmp4 = tmp2 << 1; 
tmp3 = tmp2 “ tmp4; 
tmp4 = tmp2 << 3; 
tmp3 = tmp3 “ tmp4; 
tmp4 = tmp2 << 4; 
tmp3 = tmp3 " tmp4; 
return tmp3 " X [0]; 


B Running time data 


In this section we include exact running times used to generate the charts in the main part of 
the paper. 

Running times of various algorithms for a graph with exactly one witness (data for Figure [TJ 
upper left). Each running time is the median of 5 runs for the same input instance. The input 
graph has 1000 vertices and k G { 6 ,..., 18}. 


k 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

fifo 

0.05 

0.15 

0.3 

0.77 

1.65 

3.37 

8.08 

15.69 

38.54 

82.44 

113.2 

227.11 

610.27 

HKLR 

0.16 

0.51 

1.04 

3.04 

8.94 

16.22 

36.44 

87.22 

158.37 

386.4 

770.37 

1585.88 

3866.21 

D & C 

1.52 

7.11 

22.43 

247.5 

711.34 

3164.32 

12015.78 








Running times of various algorithms for a graph with exactly one witness (data for Figure [lj 
upper right). Each running time is the median of 5 runs for the same input instance. The size 
of the pattern is k = 14 and the number of vertices n varies. 


n 

100 

250 

500 

1000 

2500 

5000 

10000 

fifo 

HKLR 

2.93 

19.66 

6.12 

68.73 

19.22 

78.84 

39.43 

179.64 

68.9 

348.33 

113.19 

625.68 

239.13 

1326.63 
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Running times of various algorithms for a graph with fl(n 2 ) witnesses (data for Figure [lj 
lower left). Each running time is the median of 5 runs for the same input instance. The input 
graph has 1000 vertices and k 6 {7,9,. ..,17}. 


k 

7 

9 

11 

13 

15 

17 

fifo 

0.01 

0.08 

0.38 

1.86 

7.93 

38.55 

HKLR 

0.16 

1.05 

6.64 

46.09 

228.02 


D &: C 

4.61 

189.48 

3258.19 





Running times of various algorithms for a graph with fl(n 2 ) witnesses (data for Figure [lj 
lower right). Each running time is the median of 5 runs for the same input instance. The size 
of the pattern is k = 15 and the number of vertices n varies. 


n 

100 

250 

500 

1000 

2500 

5000 

10000 

fifo 

HKLR 

1.04 

33.68 

1.88 

64.77 

3.92 

135.93 

7.94 

235.23 

21.02 

612.73 

39.35 

1771.58 

78.13 


Experiments for the graph motif problem (data for Figure [2j left). Each running time is 
the median of 5 runs for the same input instance. The input graph has 8000 vertices, 32000 
edges and k G {6 ,..., 14}. 


k 

6 

7 

8 

9 

10 

11 

12 

13 

14 

time (s) 

0.28 

0.78 

1.93 

5.03 

13.81 

31.69 

84.88 

198.55 

430.44 


Experiments for the graph motif problem (data for Figure [2j right). Each running time is 
the median of 5 runs for the same input instance. The size of the motif is fixed as k = 14 and 
the number of vertices n varies. The number of edges is always m = 4 n. 


n 

100 

250 

500 

1000 

2500 

5000 

10000 

time (s) 

20.22 

22.71 

67.71 

81.96 

168.73 

264.78 

640.06 


Comparison of three implementations of GF(2 <? ) arithmetic (data for Figure [3j left). A 
single run of /c-path decision oracle for instances with no solution. The pattern size is fixed as 
k = 16 and the number of vertices n varies. 


k 

128 

256 

512 

1024 

2048 

4096 

8192 

lookup, GF(2 7 ) 

2.24 

4.55 

9.04 

18.3 

36.41 

71.61 

141.61 

lookup xlO, GF(2 7 ) 

22.4 

45.5 

90.4 

183.0 

364.1 

716.1 

1416.1 

naive, GF(2 26 ) 

28.42 

56.65 

114.48 

227.37 

455.36 

908.72 

1817.9 

clmul, GF(2 64 ) 

9.64 

17.2 

34.73 

70.08 

146.06 

274.27 

544.51 


Comparison of three implementations of GF(2 <? ) arithmetic (data for Figure [3j right). Fifo 
algorithm using 7’-path decision oracle for instances with exactly one solution (each running 
time on the graph is the median of 5 runs for the same input instance). The pattern size is 
fixed as k = 12 and the number of vertices n varies. 


k 

128 

256 

512 

1024 

2048 

4096 

8192 

16384 

clmul, GF(2 b4 ) 

2.82 

5.39 

9.29 

18.08 

36.81 

75.91 

151.57 

343.96 

naive, GF(2 5 ) 

1.93 

3.32 

5.66 

10.87 

22.83 

48.16 

100.77 

197.77 

lookup, GF(2 7 ) 

0.67 

1.21 

2.52 

4.81 

9.85 

21.13 

43.33 

83.54 
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